Data Resources and Moffitt Digital Help Services

UC Berkeley Library

Statistics Undergraduate Student Association

Spring 2018

“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.”

-John Tukey

Plan your Research with a Literature Review

http://www.lib.berkeley.edu/

http://scholar.google.com

http://guides.lib.berkeley.edu/all-guides

Plan your Research with a Literature Review

Structure and availability of data

Unit of Analysis Geography Time-Period Frequency
Aggregated or Microdata? (counties/nations/households vs. individuals) Is there a geographic component to your topic? (U.S., Sub-Saharan Africa, India) Do you want a data for a specific time period? (1980-2000, 1930-1960) How often do you want measures for your variables? (every year, every ten years, monthly, quarterly)

Providers

Researchers Government Agencies NGO/IGOs Research Organizations
Are there people you know who are doing this kind of research? Think about government agencies - is the request for some official statistics or data that they’d be likely to collect and publish? (Department of Energy, CDC, Census Bureau) Are there councils or interest organizations devoted to the topic that might collect data independently? (World Bank, OECD) Would any specific research organizations be interested in the topic? (Pew, Roper, Gallup, ACLU )

Library Licensed Data Aggregators

Data Planet

Social Explorer

Policy Map

Statista

Data Repositories for Replication Data

Dataverse

ICPSR

APIs

https://libraries.mit.edu/scholarly/publishing/apis-for-scholarly-resources/

Scraping

https://en.wikipedia.org/wiki/UFO_sightings_in_the_United_States

Scraping with Python

Scraping with R

library(rvest)
library(dplyr)
ufo <- read_html("https://en.wikipedia.org/wiki/UFO_sightings_in_the_United_States")

ufo_date <- html_nodes(ufo,'td:nth-child(1)') %>% html_text() 
ufo_date <- ufo_date[c(-1, -44)] #remove extra elements
ufo_state <- html_nodes(ufo,'td:nth-child(3)') %>% html_text()
ufo_name <- html_nodes(ufo,'td:nth-child(4)') %>% html_text()
ufo_df<-data.frame(date = ufo_date, name = ufo_name, state = ufo_state)

head(ufo_df, n =5)
##                            date                        name      state
## 1 \nIndex of ufology articles\n        Cooper St. UFO crash   New York
## 2                    April 1997       Battle of Los Angeles California
## 3             February 24, 1942       Maury Island incident Washington
## 4                 June 21, 1947 Kenneth Arnold UFO sighting Washington
## 5                 June 24, 1947                                Montana

Miscellaneous Collections

https://vincentarelbundock.github.io/Rdatasets/datasets.html

https://github.com/caesar0301/awesome-public-datasets

Text-mining

Text-mining

http://guides.lib.berkeley.edu/text-mining

D-Lab, Library Data Lab, Statistics Department

Research appointments with Research Librarians

http://www.lib.berkeley.edu/help/research-appointments

Data Acquisition and Access Program (DAAP)

Berkeley Research Data Mangement

“Research Data Management helps researchers navigate the increasingly complex landscape of data planning, storage, and sharing”

http://researchdata.berkeley.edu/

Peer Consulting at Moffitt

Peer Consulting in collaboration with Division of Data Sciences

Reaching out